70
Algorithms for Binary Neural Networks
is the reconstruction error that is assumed to obey a Gaussian prior with zero mean and
variance ν. Under the most probable y (corresponding to y = 0 and x = w−1 ◦ˆx, i.e., the
minimum reconstruction error), we maximize p(x|y) to optimize x for quantization (e.g.,
1-bit CNNs) as:
max p(x|y),
(3.98)
which can be solved based on Bayesian learning that uses Bayes’ theorem to determine the
conditional probability of a hypothesis given limited observations. We note that the calcu-
lation of BNNs is still based on optimizing x, as shown in Fig. 3.19, where the binarization
is performed based on the sign function. Equation 3.98 is complicated and difficult to solve
due to the unknown w−1 as shown in Eq. 3.97. From a Bayesian learning perspective, we
resolve this problem via Maximum A posteriori (MAP):
max p(x|y) = max p(y|x)p(x)
= min ||ˆx −w ◦x||2
2 −2ν log
p(x)
,
(3.99)
where
p(y|x) ∝exp(−1
2ν ||y||2
2) ∝exp(−1
2ν ||ˆx −w ◦x||2
2).
(3.100)
In Eq. 3.100, we assume that all components of the quantization error y are i.i.d., thus
resulting in a simplified form. As shown in Fig. 3.19, for 1-bit CNNs, x is usually quantized
to two numbers with the same absolute value. We neglect the overlap between the two
numbers, and thus p(x) is modeled as a Gaussian mixture with two modes:
p(x)= 1
2(2π)−N
2 det(Ψ)−1
2
exp
−(x −μ)T Ψ−1(x −μ)
2
+ exp
−(x + μ)T Ψ−1(x + μ)
2
≈1
2(2π)−N
2 det(Ψ)−1
2
exp
−
(x+−μ+)TΨ−1
+ (x+ −μ+)
2
+ exp
−
(x−+ μ−)T Ψ−1
−(x−+ μ−)
2
,
(3.101)
where x is divided into x+ and x−according to the signs of the elements in x, and N is
the dimension of x. Accordingly, Eq. 3.99 can be rewritten as:
min||ˆx −w ◦x||2
2 + ν(x+ −μ+)T Ψ−1
+ (x+ −μ+)
+ ν(x−+ μ−)T Ψ−1
−(x−+ μ−) + ν log
det(Ψ)
,
(3.102)
where μ−and μ+ are solved independently. det(Ψ) is accordingly set to be the determinant
of the matrix Ψ−or Ψ+. We call Eq. 3.102 the Bayesian kernel loss.
Bayesian feature loss: We also design a Bayesian feature loss to alleviate the disturbance
caused by the extreme quantization process in 1-bit CNNs. Considering the intraclass com-
pactness, the features fm of the m-th class supposedly follow a Gaussian distribution with
the mean cm as revealed in the center loss [245]. Similarly to the Bayesian kernel loss, we
define ym
f = fm −cm and ym
f ∼N(0, σm), and we have:
min||fm −cm||2
2+
Nf
n=1
σ−2
m,n(fm,n−cm,n)2+log(σ2
m,n)
,
(3.103)
which is called the Bayesian feature loss. In Eq. 3.103, σm,n, fm,n, and cm,n are the n-th
elements of σm, fm, and cm, respectively. We take the latent distributions of kernel weights
and features into consideration in the same framework and introduce Bayesian losses to
improve the capacity of 1-bit CNNs.